Fast Runtime Block Cyclic Data Redistribution on Multiprocessors
نویسندگان
چکیده
Block cyclic distribution seems to suit well for most linear algebra algorithms and this type of data distribution was chosen for the ScaLAPACK library as well as for the HPF language. But one has to choose a good compromise for the size of the blocks (to achieve a good computation and communication eeciency and a good load balancing). This choice heavily depends on each operation, so it is essential to be able to go from one block cyclic distribution to another very quickly. Moreover, it is also essential to be able to choose the right number of processors and the best grid shape for a given operation. We present here the data redistribution algorithms we implemented in the ScaLAPACK library in order to go from one block cyclic distribution on a grid to another one on another grid. A complexity study is made that shows the eeciency of our solution. Timing results on the Intel Paragon and the Cray T3D corroborate our results.
منابع مشابه
Runtime Array Redistribution in HPF Programs
This paper describes eecient algorithms for run-time array redistribution in HPF programs. We consider block(m) to cyclic, cyclic to block(m) and the general cyclic(x) to cyclic(y) type redistributions. We initially describe algorithms for one-dimensional arrays and then extend the methodology to multidimen-sional arrays. The algorithms are practical enough to be easily implemented in the runti...
متن کاملRuntime Array Redistribution in HPF
This paper describes eecient algorithms for run-time array redistribution in HPF programs. We consider block(m) to cyclic, cyclic to block(m) and the general cyclic(x) to cyclic(y) type redistributions. We initially describe algorithms for one-dimensional arrays and then extend the methodology to multidimen-sional arrays. The algorithms are practical enough to be easily implemented in the runti...
متن کاملA Generalized Processor Mapping Technique for Array Redistribution
ÐIn many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access in many parallel programs on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistribu...
متن کاملAn Optimal Processor Replacement Scheme for Efficient Communication of Runtime Data Redistribution
AbstractDynamic data distribution is used to enhance data locality and algorithm performance with reducing inter-processor communication in data parallel programs on distributed memory multi-computers. Since the exchange of data is performed at run-time, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of ex...
متن کاملEfficient FFT mapping on GPU for radar processing application: modeling and implementation
General-purpose multiprocessors (as, in our case, Intel IvyBridge and Intel Haswell) increasingly add GPU computing power to the former multicore architectures. When used for embedded applications (for us, Synthetic aperture radar) with intensive signal processing requirements, they must constantly compute convolution algorithms, such as the famous Fast Fourier Transform. Due to its ”fractal” n...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Parallel Distrib. Comput.
دوره 45 شماره
صفحات -
تاریخ انتشار 1997